Search CORE

9 research outputs found

The Journey is the Reward: Unsupervised Learning of Influential Trajectories

Author: Bengio Yoshua
Binas Jonathan
Ozair Sherjil
Publication venue
Publication date: 22/05/2019
Field of study

Unsupervised exploration and representation learning become increasingly important when learning in diverse and sparse environments. The information-theoretic principle of empowerment formalizes an unsupervised exploration objective through an agent trying to maximize its influence on the future states of its environment. Previous approaches carry certain limitations in that they either do not employ closed-loop feedback or do not have an internal state. As a consequence, a privileged final state is taken as an influence measure, rather than the full trajectory. We provide a model-free method which takes into account the whole trajectory while still offering the benefits of option-based approaches. We successfully apply our approach to settings with large action spaces, where discovery of meaningful action sequences is particularly difficult.Comment: ICML'19 ERL Worksho

arXiv.org e-Print Archive

Multimodal Transitions for Generative Stochastic Networks

Author: Bengio Yoshua
Ozair Sherjil
Yao Li
Publication venue
Publication date: 24/01/2014
Field of study

Generative Stochastic Networks (GSNs) have been recently introduced as an alternative to traditional probabilistic modeling: instead of parametrizing the data distribution directly, one parametrizes a transition operator for a Markov chain whose stationary distribution is an estimator of the data generating distribution. The result of training is therefore a machine that generates samples through this Markov chain. However, the previously introduced GSN consistency theorems suggest that in order to capture a wide class of distributions, the transition operator in general should be multimodal, something that has not been done before this paper. We introduce for the first time multimodal transition distributions for GSNs, in particular using models in the NADE family (Neural Autoregressive Density Estimator) as output distributions of the transition operator. A NADE model is related to an RBM (and can thus model multimodal distributions) but its likelihood (and likelihood gradient) can be computed easily. The parameters of the NADE are obtained as a learned function of the previous state of the learned Markov chain. Experiments clearly illustrate the advantage of such multimodal transition distributions over unimodal GSNs.Comment: 7 figures, 9 pages, submitted to ICLR1

arXiv.org e-Print Archive

Deep Directed Generative Autoencoders

Author: Bengio Yoshua
Ozair Sherjil
Publication venue
Publication date: 02/10/2014
Field of study

For discrete data, the likelihood

P(x)

can be rewritten exactly and parametrized into

P(X = x) = P(X = x | H = f(x)) P(H = f(x))

P(X | H)

has enough capacity to put no probability mass on any

x'

for which

f(x')\neq f(x)

, where

f(\cdot)

is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with

f(\cdot)

as the encoder and

P(X|H)

as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations

h=f(x)

, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood

\log p(x)

. The objective is to learn an encoder

f(\cdot)

that maps

X

f(X)

that has a much simpler distribution than

X

itself, estimated by

P(H)

. This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model

arXiv.org e-Print Archive

On Variational Bounds of Mutual Information

Author: Alemi Alexander A.
Oord Aaron van den
Ozair Sherjil
Poole Ben
Tucker George
Publication venue
Publication date: 16/05/2019
Field of study

Estimating and optimizing Mutual Information (MI) is core to many problems in machine learning; however, bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks, but the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On high-dimensional, controlled problems, we empirically characterize the bias and variance of the bounds and their gradients and demonstrate the effectiveness of our new bounds for estimation and representation learning.Comment: ICML 201

arXiv.org e-Print Archive

SketchTransfer: A Challenging New Task for Exploring Detail-Invariance and the Abstractions Learned by Deep Networks

Author: Ha David
Lamb Alex
Ozair Sherjil
Verma Vikas
Publication venue
Publication date: 24/12/2019
Field of study

Deep networks have achieved excellent results in perceptual tasks, yet their ability to generalize to variations not seen during training has come under increasing scrutiny. In this work we focus on their ability to have invariance towards the presence or absence of details. For example, humans are able to watch cartoons, which are missing many visual details, without being explicitly trained to do so. As another example, 3D rendering software is a relatively recent development, yet people are able to understand such rendered scenes even though they are missing details (consider a film like Toy Story). The failure of machine learning algorithms to do this indicates a significant gap in generalization between human abilities and the abilities of deep networks. We propose a dataset that will make it easier to study the detail-invariance problem concretely. We produce a concrete task for this: SketchTransfer, and we show that state-of-the-art domain transfer algorithms still struggle with this task. The state-of-the-art technique which achieves over 95\% on MNIST

\xrightarrow{}

SVHN transfer only achieves 59\% accuracy on the SketchTransfer task, which is much better than random (11\% accuracy) but falls short of the 87\% accuracy of a classifier trained directly on labeled sketches. This indicates that this task is approachable with today's best methods but has substantial room for improvement.Comment: Accepted WACV 202

arXiv.org e-Print Archive

Maximum Entropy Generators for Energy-Based Models

Author: Bengio Yoshua
Courville Aaron
Goyal Anirudh
Kumar Rithesh
Ozair Sherjil
Publication venue
Publication date: 27/05/2019
Field of study

Maximum likelihood estimation of energy-based models is a challenging problem due to the intractability of the log-likelihood gradient. In this work, we propose learning both the energy function and an amortized approximate sampling mechanism using a neural generator network, which provides an efficient approximation of the log-likelihood gradient. The resulting objective requires maximizing entropy of the generated samples, which we perform using recently proposed nonparametric mutual information estimators. Finally, to stabilize the resulting adversarial game, we use a zero-centered gradient penalty derived as a necessary condition from the score matching literature. The proposed technique can generate sharp images with Inception and FID scores competitive with recent GAN techniques, does not suffer from mode collapse, and is competitive with state-of-the-art anomaly detection techniques

arXiv.org e-Print Archive

MINE: Mutual Information Neural Estimation

Author: Baratin Aristide
Belghazi Mohamed Ishmael
Bengio Yoshua
Courville Aaron
Hjelm R Devon
Ozair Sherjil
Rajeswar Sai
Publication venue
Publication date: 07/06/2018
Field of study

We argue that the estimation of mutual information between high dimensional continuous random variables can be achieved by gradient descent over neural networks. We present a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size, trainable through back-prop, and strongly consistent. We present a handful of applications on which MINE can be used to minimize or maximize mutual information. We apply MINE to improve adversarially trained generative models. We also use MINE to implement Information Bottleneck, applying it to supervised classification; our results demonstrate substantial improvement in flexibility and performance in these settings.Comment: 19 pages, 6 figure

arXiv.org e-Print Archive

Generative Adversarial Networks

Author: Bengio Yoshua
Courville Aaron
Goodfellow Ian J.
Mirza Mehdi
Ozair Sherjil
Pouget-Abadie Jean
Warde-Farley David
Xu Bing
Publication venue
Publication date: 10/06/2014
Field of study

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples

arXiv.org e-Print Archive

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale

arXiv.org e-Print Archive